As discussed in class and the syllabus, you will be working on final projects throughout the rest of the semester, to be submitted during finals week. See here for many examples of last semester’s finals projects.
For this week, all you have to do is pick your final project team. All teams should consist of 3-4 students. You have two options for picking a team:
Amongst yourselves, organize a team of 3-4 students that you would like to work with for the final project. Once your team has assembled, email me (as a team) at zach@stat.cmu.edu telling me your team.
Email me at zach@stat.cmu.edu telling me that you would like to be randomized to a team. If you find only one other student you’d like to work with, you (as a pair) can email me telling me you (as a pair) would like to be randomized to a team.
All you have to do to get the 10 points for this part is email me by class on Wednesday, October 27 with this information. Thus, you have a bit more time after the homework deadline to do this, but the sooner the better. This should be an easy 10 points!
In this problem, we will use a dataset on students’ academic performance, found here:
data = read.csv("https://raw.githubusercontent.com/zjbranson/315Fall2021/master/students.csv")
Details about the dataset are found here. However, the main things you need to know about this dataset are:
Grade is classified as Low (L), Medium (M), or High (H).data.subset, which contains only the following variables:RaisedHandsVisitedResourcesAnnouncementsViewDiscussionGenderGradeAfter you’ve made data.subset, use the ggpairs function to make a pairs plot of the quantitative variables in data.subset (i.e., the first four variables in the above list). Your plot should be a 4x4 grid, 6 of which are scatterplots. Don’t worry about changing the title/labels.
#creating the desired subset
data.subset = subset(data,
select = c("RaisedHands", "VisitedResources", "AnnouncementsView", "Discussion", "Gender", "Grade"))
#creating a pairs plot of just quantitative variables
library(GGally)
ggpairs(data = data.subset, columns = 1:4)
After you’ve made your plot, answer the following questions:
VisitedResources and RaisedHands has the highest correlation (0.692).
VisitedResources and Discussion has the lowest correlation (0.243).
RaisedHandsVisitedResourcesGradeAlso, using the mapping argument, color the pairs plot by the Gender variable. Make sure that there is some transparency in the plot.
ggpairs(data = data.subset, columns = c(1,2,6), mapping = aes(color = Gender, alpha = 0.7))
After you’ve made your plot, answer the following: In 1-3 sentences, describe the distribution of VisitedResources conditional on Grade and Gender.
From our plot in Part C (in particular, the side-by-side boxplot in the middle row), we can see that the distribution of VisitedResources is similar between genders regardless of the level of Grade. However, we can see that High and Medium grades (H and M, respectively) tend to have high values of VisitedResources; meanwhile, Low grades (L) tend to have low values of VisitedResources.
Hint: This question is NOT asking you to describe the distribution of (1) VisitedResources conditional on Grade, and (2) VisitedResources conditional on Gender. Rather, it’s asking you to describe the distribution of VisitedResources conditional on Grade AND Gender (together).
In this problem, we will continue working with the student dataset from Problem 2.
RaisedHands and VisitedResources with contour lines added using geom_density2d().ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d() + geom_point() +
labs(x = "Raised Hands", y = "Visited Resources")
geom_density2d() estimates these bandwidths by default. Now, copy-and-paste your above code, but make the bandwidth smaller by setting h = c(10, 10) within geom_density2d().ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d(h = c(10,10)) + geom_point() +
labs(x = "Raised Hands", y = "Visited Resources")
The first plot consists of 2-3 modes, which are divided into many smaller modes in the second plot. For example, in the second plot, the top-right of the scatterplot is still captured by one large mode (similar to the first plot), but that mode is now divided into 2-4 smaller modes. There are also very many “small islands” of modes throughout the plot, denoting very small clusters of points that are similar in terms of these two variables (which we don’t see in the first plot). More generally, making the bandwidth smaller emphasizes many small modes within the data, similar to what we saw with kernel density smoothing for one quantitative variable.
RaisedHands and VisitedResources with contour lines, but with the following changes:h = c(80, 80) within geom_density2d()Grade and the shape of the points according to Gender.ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d(h = c(80, 80)) +
geom_point(aes(color = Grade, shape = Gender)) +
labs(x = "Raised Hands", y = "Visited Resources")
After you’ve made your plot, answer the following two questions:
RaisedHands and VisitedResources.From this plot, we can see that there are two modes (one in the bottom left of the plot, and one in the top right of the plot). These two modes can be considered students with “limited engagement” and “a lot of engagement,” respectively (i.e., the first mode denotes students with low levels of RaisedHands and VisitedResources, and the second mode denotes students with high levels of these variables).
Grade and Gender.It seems like the students with “limited engagement” (i.e., students in the bottom left mode) tend to be male and have Low (L) or Medium (M) grades. Meanwhile, students with “a lot of engagement” (i.e., students in the top right mode) tend to have Medium (M) or High (H) grades; there also appears to be about an equal distribution of genders in this mode.
[As a sidenote: Note that the above is the correct plot. It is less informative to put color and shape within ggplot() itself (see below); this will create a separate set of contour lines for each color and shape combination.]
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources, color = Grade, shape = Gender)) + geom_density2d(h = c(80, 80)) +
geom_point() +
labs(x = "Raised Hands", y = "Visited Resources")
RaisedHands and VisitedResources with points added but no contour lines (using the default bandwidth) with stat_density2d. Furthermore, change the default colors using scale_fill_gradient() and setting the low and high arguments in that function. Be sure that you use geom_point() after you use stat_density2d (otherwise, you won’t be able to see the points).ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) +
stat_density2d(aes(fill = ..density..), geom = "tile", contour = F) +
scale_fill_gradient(low = "white", high = "red") +
geom_point() +
labs(x = "Raised Hands", y = "Visited Resources")
scale_fill_gradient() to scale_fill_gradient2().scale_fill_gradient2(), specify a “medium density” color using the mid argument (similar to the low and high arguments).scale_fill_gradient2(), there is an argument called midpoint that specifies what a “medium density” is. The default is 0, which doesn’t make sense for densities, because 0 is the lowest possible value for densities. So, set midpoint equal to a non-zero number that you think makes sense, given the range of densities you saw in your previous heat map.You should end up with a heat map that looks kind of cool, or at least cooler than the two-color heat map.
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) +
stat_density2d(aes(fill = ..density..), geom = "tile", contour = F) +
scale_fill_gradient2(low = "white", mid = "blue", high = "red", midpoint = 0.00015) +
geom_point() +
labs(x = "Raised Hands", y = "Visited Resources")
Hint: For the midpoint argument, your graph should be a gradient of three different colors that you’ve specified. If this isn’t the case, you may have specified midpoint poorly.
We’ll again work with the olive oils dataset used in Lab7. The dataset can be found here and more information about the data can be found here.
Here is the code to define the olive dataset:
olive = read.csv("https://raw.githubusercontent.com/zjbranson/315Fall2021/main/olive.csv")
area or region) to compute that distance. Remember to standardize your variables.# first, grab just the quantitative variables
olive.subset = subset(olive, select = -c(area, region))
# standardize the variables
olive.subset = apply(olive.subset, MARGIN = 2, FUN = function(x) x/sd(x))
olive.subset = as.data.frame(olive.subset)
#compute the distance matrix
dist.olive = dist(olive.subset)
#run MDS
mds.olive = cmdscale(dist.olive, k = 2)
#add MDS dimensions to the dataset
olive$mds1 = mds.olive[,1]; olive$mds2 = mds.olive[,2]
#scatterplot of MDS dimensions
ggplot(olive, aes(x = mds1, y = mds2)) + geom_point()
Add something to the graph such that you can determine how many modes there are in your Part A scatterplot.
Color the points by area.
ggplot(olive, aes(x = mds1, y = mds2)) + geom_point(aes(color = area)) + geom_density2d()
After you’ve made your plot, answer the following questions:
It appears that there are three main modes (left center, middle center, and right center) as well as two minor modes (bottom center and top left).
The three main modes in the data clearly correspond to the three different areas in the dataset: The left-center mode corresponds to Northern oils, the middle-center mode corresponds to Sardinian oils, and the right-center mode corresponds to Southern oils. Meanwhile, the top-left minor mode corresponds to Northern oils, while the bottom-right minor mode corresponds to Southern oils. Thus, Sardinian oils appear to be tightly clustered, while Northern and Southern oils are also quite clustered but to a less concentrated degree. This in intuitive: Sardinia is an island, while North and South are more general areas of Italy, and thus we would expect there to be more heterogeneity in the oils that come from those regions.
area, region, or \(MDS_2\) (the second dimension from MDS) in your regression.summary(lm(mds1~.-mds2-area-region, data = olive))
##
## Call:
## lm(formula = mds1 ~ . - mds2 - area - region, data = olive)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.888e-13 -5.200e-16 5.600e-16 1.590e-15 3.104e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.093e+00 3.013e-13 6.946e+12 <2e-16 ***
## palmitic 2.733e-03 3.287e-17 8.314e+13 <2e-16 ***
## palmitoleic 8.577e-03 4.036e-17 2.125e+14 <2e-16 ***
## stearic -2.685e-03 3.777e-17 -7.108e+13 <2e-16 ***
## oleic -1.218e-03 2.999e-17 -4.061e+13 <2e-16 ***
## linoleic 1.506e-03 2.945e-17 5.114e+13 <2e-16 ***
## linolenic 1.689e-02 9.267e-17 1.822e+14 <2e-16 ***
## arachidic 1.036e-02 5.178e-17 2.002e+14 <2e-16 ***
## eicosenoic 2.214e-02 8.485e-17 2.610e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.659e-14 on 563 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.648e+29 on 8 and 563 DF, p-value: < 2.2e-16
From the above, all of the variables are significantly related with \(MDS_1\) (all of the \(p\)-values are quite small). In particular, the following variables are positively associated with \(MDS_1\):
palmiticpalmitoleiclinoleiclinolenicarachidiceicosenoicAnd these variables are negatively associated with \(MDS_1\):
stearicoleicarea.area.alpha = 0.5 such that there is some transparency in the plot.library(GGally)
ggpairs(olive, columns = c(3:6), mapping = aes(color = area, alpha = 0.5))
After making your graph, summarize the main takeaways from that graph in 1-4 sentences. In your interpretation, be sure to compare the different areas in terms of each of the four variables you plotted.
We choose palmitic and palmitoleic as the positively-associated variables and stearic and oleic as the negatively-associated variables. Below is a pairs plot colored by area.
By looking at the smoothed density plots along the diagonal, we can see that South has much larger palmitic and palmitoleic than the other two areas (which do not seem to have large differences in these two variables). There does not seem to be large differences in stearic among the three regions, although there is less variance for Sardinia. Meanwhile, South exhibits the lowest oleic values, followed by Sardinia, followed by North. Looking at the scatterplots in the off-diagonal, it appears that North and Sardinia only appear to well-separate based on oleic; meanwhile, South seems well-separated from the other two areas in every respect, but with much more variation in every variable.